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Statistical methods have been widely employed in recent years to grasp many language properties. 
The application of such techniques have allowed an improvement of several linguistic applications, 
which encompasses machine translation, automatic summarization and document classification. In 
the latter, many approaches have emphasized the semantical content of texts, as it is the case of 
bag-of-word language models. This approach has certainly yielded reasonable performance. How¬ 
ever, some potential features such as the structural organization of texts have been used only on a 
few studies. In this context, we probe how features derived from textual structure analysis can be 
effectively employed in a classification task. More specifically, we performed a supervised classifica¬ 
tion aiming at discriminating informative from imaginative documents. Using a networked model 
that describes the local topological/dynamical properties of function words, we achieved an accuracy 
rate of up to 95%, which is much higher than similar networked approaches. A systematic analysis 
of feature relevance revealed that symmetry and accessibility measurements are among the most 
prominent network measurements. Our results suggest that these measurements could be used in 
related language applications, as they play a complementary role in characterizing texts. 


I. INTRODUCTION 

The ever-growing amount of available documents in the 
Web has propelled the development of statistical natural 
language processing methods in recent years. Examples 
of related applications trying to “understand” unstruc¬ 
tured data include machine translation [1], text summa¬ 
rization [2-5], information retrieval [6, 7] and content 
analysis [8]. An application of special importance for 
the organization of electronic data is the classification 
task, which automatically assigns one or more labels for 
a word, sentence, paragraph or entire documents [9-12]. 
Traditional textual categorization methods usually serve 
to identify the relevance of texts (i.e. whether it is a spam 
or not) or the meaning conveyed by words and expres¬ 
sions [13-17]. Recent classification tasks, however, have 
emphasized other textual aspects. For example, the cat¬ 
egorization of texts according to the their polarity (e.g. 
positive or negative) has become a relevant task for an¬ 
alyzing e.g. customer reviews [18] or global variations 
in mood via polarity analysis of twitter messages [19]. 
Note that most of these classification tasks are depen¬ 
dent on text content, since the presence of one or more 
specific words provides clues about the classes being in¬ 
ferred. While the semantic content is crucial for the suc¬ 
cess of these applications, the structure of texts might 
play an important role in classifications problems where 
the semantics of words is not crucial for the purpose of 
categorization. This is the case of identifying the style of 


texts, since documents on the same subject might display 
different writing styles. In contrast to semantic-based 
traditional classification tasks, in this paper we probe 
the relevance of textual structure to provide useful fea¬ 
tures for text classification. More specifically, we probe 
how textual structure depends on two distinct stylistic 
writing styles: imaginative and informative prose. The 
structure and organization of texts is studied via net¬ 
worked models, an well known representation of complex 
systems. 

Networks are discrete models that basically represent 
the interrelations between interacting agents in a com¬ 
plex system. Owing to the simplicity and generality of 
the model, it has been employed to model a myriad of 
real and artificial systems [20]. Despite being very differ¬ 
ent in nature, networks modelling distinct systems share 
several structural patterns [21]. Of special relevance to 
this paper, are the networked models of language and 
texts, which have been useful to unveil universal prop¬ 
erties including the scale-free and small-world phenom¬ 
ena [22] . In practical terms, networked models have been 
useful to grasp several features of texts, such as qual¬ 
ity [23], complexity [24] and authenticity [25]. Particu¬ 
larly, in this study, we used the so-called word adjacency 
model, which is a approximation of text networks formed 
by syntactical links [26]. Because the topological analy¬ 
sis of word adjacency networks does not depend on the 
interpretation of texts, it has been applied with relative 
success to study the underlying structure of texts, even 
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when textual content is entirely unknown [25]. Here we 
applied such representation to discriminate informative 
and imaginative prose. We have extended the traditional 
model in a twofold fashion: (i) we analyzed the local 
structure of particular nodes (words); and (ii) we consid¬ 
ered novel topological/dynamical that are able to unfold 
the general structure of the network of concepts. As we 
shall show, both extensions provided competitive classi¬ 
fication performance when compared to traditional sty- 
lometry methodologies. In special, we have found that 
symmetries and accessibilities were the most important 
network measurements. We believe that the proposed ex¬ 
tended model could be used to improve the performance 
of several related problems where the textual structure 
plays a prominent role in the characterization of docu¬ 
ments [27]. 

This paper is organized as follows. In Section II, we 
present the word adjacency model. Section III describes 
the pattern recognition methods used to perform the 
classification. This section also presents a method for 
measuring feature relevance in a multivariate fashion. 
The results obtained with the proposed methodology are 
described in Section IV. Finally, Section V provides a 
perspective for further studies and improvements of the 
model. 


II. REPRESENTATION AND 
CHARACTERIZATION OF TEXTS AS 
NETWORKS 

In this section, we present how a text can be repre¬ 
sented as a complex network. We also swiftly describe 
the main topological network measurements employed to 
characterize the structure of networks. 


Modeling texts as complex networks 

A complex network can be represented as a graph, 
which is defined as a set of nodes and edges [2 1] . An 
usual representation of a network is the adjacency ma¬ 
trix A = {aij}. The elements are defined as: 

1 if there is a link i —> j, 

(1) 

0 otherwise. 

In the text mining community, several networked repre¬ 
sentations of texts have been proposed [28]. If the stylis¬ 
tic properties of texts are relevant for the task being tack¬ 
led, syntactical relations are employed to establish the 
links between words [26]. If the application requires the 
extraction of semantic features, words are connected ac¬ 
cording to semantic relations, such as those present e.g. 
in the WordNet [29] or in free association graphs [30]. 
In this study, we aim at grasping textual features that 


are independent of semantic content. For this reason, we 
used a model that is able to capture stylistic textual fea¬ 
tures [12, 31]. This model, henceforth referred to as word 
adjacency model, denotes each distinct word as a node. 
The edges are established between words appearing in 
the same context. It has been shown that if one con¬ 
siders the context as the interval of two adjacent words, 
most of the syntactical relations are recovered [26]. This 
model has been successfully applied to study many lan¬ 
guage applications and related systems [31] . 

To create a word adjacency network, usually some 
pre-processing steps are applied. First, all punctuation 
marks, line breaks, spaces, numbers and special charac¬ 
ters are removed. Particularly, we are mostly interested 
in the relationship between words conveying semantic in¬ 
formation. For this reason, stopwords (or function words) 
can be optionally removed from the analysis. In the next 
step, a lemmatization step is performed in order to map 
words representing the same concept into the same node. 
To assist the lemmatization process, the part-of-speech 
tag of each word is extracted according to the proce¬ 
dure described in [32] . The part-of-speech labelling is re¬ 
quired to solve ambiguities, because the same word form 
might be mapped into distinct lemmas. After the pre¬ 
processing step, each remaining distinct word becomes a 
node and edges are established between adjacent words. 

Complex network measurements 

Currently, there are more than a hundred measure¬ 
ments employed to characterize the topological structure 
of networks [33]. Some measurements might depend not 
only on the structure, but also on a dynamical process 
(e.g. random walks) occurring on the structure. Below 
we swiftly describe the measurements used in this study. 

• Number of nodes ( V ): in a word adjacency net¬ 
work, the number of nodes is the set of different 
words in the text. In other words, the number of 
nodes is the vocabulary size of the pre-processed 
text. 

• Degree (fc): the simplest connectivity measure¬ 

ment is the node degree [34], which corresponds to 
the total number of edges connected with node i. 
This measurement is defined for directed networks 
as a ij and fc( out ) = aij for in- and 

out- degree, respectively. If one considers the undi¬ 
rected and unweighted version of the network, the 
degree ki = )TF a.yi = )TF can be understood as 
the number of distinct bi-grams that a given word 
appears. If one considers edges weights, then the 
degree is proportional to the word frequency. 

• Neighborhood connectivity (TV): this measure¬ 
ment is defined as the number of nodes that can 


3 


be reached when, starting from the reference node, 
walks of length h are performed. Note that the 
traditional degree measurement is recovered when 
h = 1 . 


been reinterpreted in the context of text networks 
as a measure of word relevance. Actually, a word is 
deemed relevant if it is very frequent in the text or if 
it appears related to other very frequent words [37]. 


• Clustering coefficient (cc): given a node i, the 
probability of its neighbors to be connected is called 
clustering coefficient ( cci) [35]. This measurement 
is defined as 


3 N A (i) 
N 3 (i) ’ 


( 2 ) 


• Eccentricity ( E ): this measurement quantifies 
the maximum geodesic distance between the ref¬ 
erence node and all other nodes [47]. Therefore, 
the maximum eccentricity value corresponds to the 
network diameter. This measurement is calculated 
for each node * as Ei = max, (dy). 


where N/±{i) is the total number of triangles (i.e. 
a click comprising three nodes) connected with 
node i, and N 3 (i) denotes the number of connected 
triples, which is defined as the amount of differ¬ 
ent connections between i and each pair of nodes. 
This measurement is traditionally used to quantify 
the local connectivity of real-world networks [36]. 
In word adjacency networks, this measurement has 
been applied to quantify the specificity of words ac¬ 
cording to the number of distinct contexts in which 
they appear [37]. 

• Betweenness centrality ( B ): to define this mea¬ 
surement, consider all paths connecting any pair 
of nodes in the network are followed via shortest 
paths [38] . The betweenness of a node u is defined 
as being proportional to the number of paths that 
passes through node u. More specifically, 


B 


U 


ST' 

Tf' cr (bj) 


( 3 ) 


• Eigenvector centrality (Ec): the eigenvector 
centrality can be understood as an extension of de¬ 
gree centrality [48], because the relevance of the 
reference node relies both on the number and rel¬ 
evance of neighbors. Considering the adjacency 
matrix A, the eigenvector centrality is defined as 
the eigenvector associated with the leading eigen¬ 
value. There are many linguistic applications that 
uses this centrality measurement. It has been ap¬ 
plied, for example, in the text summarization task 
in order to select the most relevant extracts in texts 
modelled as graphs [4[. 

• PageRank (Pr): the PageRank is widely known 
to be part of the Google’s web search [21, 49]. In 
texts networks, this measurement has been success¬ 
fully applied e.g. to disambiguate word senses [50]. 
This measure is based on the eigenvector centrality, 
and it is defined in matrix terms as 

Pr = aAD~ 1 Pr + /31, (4) 


where er(i, u. j) is the number of shortest paths be¬ 
tween i and j that passes through node u and a(i, j) 
is the total amount of shortest paths between i and 
j. According to equation 3, the betweenness cen¬ 
trality can be interpreted as the network flow [39- 
41], which is a relevant quantity for the analysis of 
robustness of power-grid networks [42, 43]. When 
applied to the analysis of text networks, this mea¬ 
surement has been interpreted as being useful to 
quantify the generality of words in which the word 
appears [44] , which is in part motivated by the use 
of this measurement in community detection meth¬ 
ods [45]. Unlike the clustering coefficient, the be¬ 
tweenness centrality uses the global connectivity in¬ 
formation to quantify the specificity/generality of 
concepts [44]. 

• Closeness centrality (C): unlike the betwenness 
centrality, which is based on the number of shortest 
paths, the closeness centrality [46] uses the length of 
the shortest paths. If d l3 is the shortest distance be¬ 
tween nodes i and j, the closeness centrality is cal¬ 
culated as Ci = V~ 1 ^2j dij. Geodesic paths have 


where a and (3 are positive constants (convention¬ 
ally (3 = 1), 1 is a vector (1,1,1,.. .) T and D is a 
diagonal matrix represented as 

max{fc-° ut) ,l} if i^j, ^ 

0 otherwise. 

In contrast with eigenvector centrality, PageRank 
considers a weighted sum of neighbors importance 
reflecting the neighbors degree. In this way, the rel¬ 
evance associated to a node is proportionally trans¬ 
ferred to its neighbors. 

• Accessibility ( A W): the accessibility is an exten¬ 
sion of the concept of neighborhood connectivity 
because it measures the effective number of nodes 
reached at the h -th concentric level [51]. The ef¬ 
fective number of nodes accessed after h steps is 
computed considering the distribution of proba¬ 
bilities of access via self-avoiding random walks. 
Mathematically, it is defined using the Shannon en¬ 
tropy [52] of the probabilities of access at the h -th 
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FIG. 1. Quantification of accessibility for the local neighbour¬ 
hood of the blue node. The probabilities near nodes represent 
the probabilities of access in self-avoiding random walks of 
length h = 1 (orange nodes) or h = 2 (green nodes). In the 
second hierarchical level, the accessibility is lower than the 
total number of nodes (green nodes). This occurs because 
the access to the second level is uneven. 

concentric level: 



Note that, according to equation (8), the highest 
weights are assigned to the nearest nodes. The gen¬ 
eralized accessibility has been applied e.g. to iden¬ 
tify influential spreaders in spatial networks [53]. 

• Symmetry ( S ): the symmetry concept is found in 
many real systems [54, 55]. Symmetric properties 
can also occur in written texts as a consequence of 
grammatical or stylistic constraints [59] . For this 
reason, we have quantified this property in word 
adjacency networks. To model such property, re¬ 
cently, some network measurements have been cre¬ 
ated [56-59]. In this paper, we used the quantities 
introduced in [58, 59], as it allows to capture sym¬ 
metric patterns in a multi-scale fashion. The defi¬ 
nition of symmetry measures rely upon the charac¬ 
terisation of hierarchic levels. The hierarchic level 
T?,(i) for a given node i is the set comprising all 
nodes h hops away from i. The symmetry mea¬ 
sures are based on the accessibility measurement, 
because the same network dynamics is taken for 
the analysis. In addition, the symmetry measure¬ 
ment can be seen as a normalization of the acces¬ 
sibility. Thus, using self-avoiding random walks, a 
node is considered to be symmetric if the access to 
its neighbors (in a given hierarchic level) is sym¬ 
metric. The symmetry (or regularity) of the access 
is measured in terms of the entropy: 


where p/j 1 is the probability of a walker start¬ 
ing from i to reach node j in h steps. In text 
networks, this measurement has been applied to 
generate summaries and to identify keywords and 
styles [2] . An example of the computation of acces¬ 
sibility is shown in Fig. 1. 

• Generalized accessibility ( Ag ): the generalized 
accessibility is an extension of the accessibility that 
does not rely on a particular length of walk. In¬ 
stead, the probabilities of transition are computed 
considering all possible lengths, which is imple¬ 
mented via definition of a modified random walk, 
the so called accessibility random walk [53]. This 
measurement is defined as 


Ag.i = exp 


E P b l0 S P 



where P is a quantity that depends on the prob¬ 
ability of the transition i —> j considering walks 
of variable length. The matrix P representing the 
probability of transition is calculated as P = W /e, 
where 


w = £ 

k—0 



( 8 ) 


S h (i) 


exp 


j- E Pi^iogp^j 
[ i er„(i) _ J 

M*) \ + J2rZoVr 


( 9 ) 


where rj r denotes the total number of dead ends 
in the r-th hierarchical level and p!'.) is the same 
quantity used to define the accessibility in equa¬ 
tion (6). There are two variations of the quantity 
proposed in equation (9). The backbone symmetry 
( Sb ), a variation of the the concept of radial sym¬ 
metry, removes all edges between nodes in the same 
hierarchical level. The merged symmetry ( Sm ), on 
the other hand, is based on the concept of angu¬ 
lar symmetry, which can be obtained by merging 
linked nodes in the same hierarchical level. To ex¬ 
emplify both variations of the symmetry concept, 
we show in Fig. 2 the transformations applied to a 
hierarchical neighborhood before the computation 
of equation (9) . 


• Modularity ( Q ): a community structure is de¬ 
fined as a network subgraph with a large number 
of intra-links and a few edges connected to the oth¬ 
ers nodes of the network. To quantify whether a 
network is organized in communities, the modular¬ 
ity compares the number of internal links with the 
expected value of the same quantity in a equivalent 
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FIG. 2. Example of merged and backbone patterns for the quantification of local symmetry. The fractions represent the 
probabilities to reach a given node in a random walk of length h = 1 (orange) or h = 2 (green). To create the backbone 
pattern, edges among nodes in the same hierarchical levels are removed. Differently, merged patterns are computed by merging 
connected nodes into a single node. 


random network [60]. This quantity is computed 
as 

« = 2^ EE - 2m) (10) 

* j 

where M = 1/2 Yhij a ij the total number of edges 
in the network, c* and Cj are the communities to 
which nodes i and j belong, and 


8{ci , Cj ) 


1 if Cj = Cj , 
0 otherwise. 


( 11 ) 


Usually, word adjacency networks display low val¬ 
ues of modularity. A more consistent organization 
in communities can be found e.g. in semantic net¬ 
works such as the WordNet [29]. 


Characterization of texts with complex networks 

So far we have presented several topological/dynamical 
measurements of complex networks. The objective here 
is to use these quantities to characterise styles in texts. 
Note that several measurements are locally defined, i.e. 
each node possess a value. There are several possibilities 
to use these local measurements to characterise the net¬ 
works. In this paper, we have used the following three 
distinct methodologies: 


• Global strategy without stopwords (GS): in 

this approach, we sum up the local measurements 
to characterise the networks. The most natural 
summarisation procedure is to take the average 
(. X ), where ( X ) = V~ x JU Xi and X is a local mea¬ 
surement. We also used the following quantities: 
the standard deviation a(X ), the median (A), the 
maximum value (max(X)) and the minimum value 
(min(X)). The only global measurement, the mod¬ 
ularity, was also used considered in this strategy. 
Following several approaches for grasping textual 
features with networks, we have removed all stop- 
word before the formation of the networks. These 
words were disregarded from the analysis because 
they just serve to connect content words in the word 
adjacency model. 


• Local strategy without stopwords (LS): in 

this approach, the value of each measure X for each 
word becomes a feature. Similarly to the GSW, all 
stopwords are removed from the analysis. Because 
features are defined for each word, global networks 
measurements are not considered in this case. 


• Local strategy with stopwords (LSS): this is 
the same local approach adopted in the LS method. 
However, this variation also considers all stopwords 
in the analysis. 
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III. PATTERN RECOGNITION AND 
EVALUATION 

In this section, we present the methodology for analyz¬ 
ing the relationship between the texts and the categories 
(informative and imaginative). More specifically, we de¬ 
scribe the pattern recognition methods employed and the 
methods to compute the quality of the classification and 
relevance of the proposed features (see Section II). 

Pattern recognition methods 

To study the relationship between complex network 
measurements and text style, we used a feature selection 
algorithm and three different supervised classifiers. The 
method used to select the features was the information 
gain [61], which is a supervised attribute filter known as 
mutual information [62]. Given the random variables A' 
and Y, the mutual information I(X,Y) is computed as 

i(x,y) = e E^) lQ g f v ( 12 ) 

where p{x ) and p{y) are probability functions and p(x, y) 
is the joint probability. 

The information gain corresponds to the mutual in¬ 
formation when X is the values obtained for a given at¬ 
tribute and Y is a vector of corresponding classes. This 
technique is used to create a decreasing sorted ranking 
of relevance. Thus, the most relevant attributes, i.e. the 
ones with the highest values of information gain, are se¬ 
lected to perform the classification. An important char¬ 
acteristic of this method is that the attributes are evalu¬ 
ated separately, i.e. the information of a given attribute 
does not influence the others. 

In our experiments, the following pattern recognition 
methods were used: 

• Nearest neighbors: the K nearest neighbors clas¬ 
sifier (ATNN) considers the local neighborhood of 
the test instance [63]. Given a test instance, the 
class chosen is the majority class in the set of the 
K nearest neighbors in the training dataset. Fur¬ 
ther details concerning this method can be found 
in [64]. 

• Classification and regression tree: this method 
represents the patterns found in the dataset as a 
tree, a data structure storing nested rules. Even 
though there are several tree algorithms, we chosen 
to use the classification and regression tree (CART) 
method [65] because it has some advantages as it 
is relatively simple for interpret and the the pre¬ 
dictor variables are not previously assumed [66] . A 
major advantage of tree-based pattern recognition 
algorithms the patterns found in the dataset are 


not hidden from the user, as it happens in artificial 
neural networks methods [64]. 

• Naive Bayes: the Naive Bayes algorithm is based 
on the Bayes theorem [67]. Assuming feature inde¬ 
pendence, the correct class c of an instance is given 

by 

c = argmax [logP(cfc) + V logP(/,|c fe )], 

Cfc 

fj&F 

where c*, is one of the possible classes, f 3 £ F 
is a particular feature. To compute the quantity 
P(fj |cfc) we assumed that the likelihood of the fea¬ 
tures follows a bell shape [68] . 

We have chosen the aforementioned methods because 
they yield good performance when set with default pa¬ 
rameters [69]. The evaluation of the performance of 
the methods when set with default parameters was per¬ 
formed with the “leave one out” algorithm [70]. This eval¬ 
uation procedure consists in selecting one element of the 
dataset to be used as an test instance, while the remain¬ 
ing instances are used in the training phase. This pro¬ 
cedure is then repeated until all instances of the dataset 
have been chosen as an test instance. 


Quantifying feature relevance 

To quantify the relevance of features for the classifi¬ 
cation task, the following method was used. Let F = 
{/ 1 J 2 ,...} be the set of attributes comprising $ dis¬ 
tinct attributes. We generate a set F c comprising all 2 $ 
combinations of features. For a particular classification 
method, we compute the accuracy rate obtained for each 
classifier in F c . The accuracy rate is then employed to 
sort (in decreasing order) the classifiers in F c . A function 
is associated for each attribute in F: 

k 

n(f z ,k) = J2“(f*,j) (is) 

i =1 

where = 1 if the j-th best classifier in F c used 

the i-tli attribute. If the *-th feature was not used in 
the j-th best classifier, then = 0. Note that the 

function Q quantifies how frequent a given feature is in 
the best classifiers. If this function has a fast growth for 
low values of k, then it is a relevant feature because it 
appears in the classifiers with the highest accuracy rates. 
To quantify how frequent a given feature /, is among the 
best classifiers, the following index of feature relevance 
can be defined: 

2 i>-i 2 h-i k 

R(fi) = E = E ( 14 ) 

k—1 k—1 j =1 
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Unlike traditional index devised to measure the relevance 
of attributes, the index defined in eq. (14) takes into 
account the non-trivial inter-relationship between fea¬ 
tures [71]. 


IV. RESULTS AND DISCUSSION 

In this section, we analyze the proposed technique for 
discriminating informative and imaginative prose. We 
also compare the proposed technique with other tradi¬ 
tional natural language processing methods. In our ex¬ 
periments, we used the Brown University Standard Cor¬ 
pus of Present-Day American English (a.k.a. Brown Cor¬ 
pus) [72]. Because the set of informative texts comprises 
several short texts, for this class we have selected only 
the 126 longest texts. As such, in our experiments, each 
class is represented by the same number of instances. 
Each class can also be classified in subclasses. The set of 
informative texts used in this study comprises 80 scien¬ 
tific manuscripts, 30 miscellaneous texts and 16 biogra¬ 
phies and related subjects. The set of imaginative doc¬ 
uments comprises general fiction, romances, love stories 
and others. Note that we have not used this fine-grained 
description in our experiments. 


Complex network approach 

Following the steps in the methodology, we created 
a word adjacency network for each document in the 
dataset. The topological measurements were extracted 
and the 15 most relevant features were selected according 
to the information gain criterion. In the global strategy, 
the following features have been selected: 

Vocabulary size: V ; 

Degree connectivity: (fc); 

PageRank: Pr, (Pr), cr(Pr) and max(Pr); 

Clustering coefficient: cr(cc) and (cc); 

Closeness centrality: min(C'), er(C’), (C) and C; 
Generalized accessibility: {Ag)\ and 
Betweenness centrality: (B) and B. 

In this case, the accuracy rate reached a maximum 
value of 78% with the Naive Bayes algorithm. Note 
that this result is statistically significant, as the p -value 
associated with this accuracy rate is p < 1.0 x 10 -10 . 
The accuracy rate obtained for the other classifiers are 
shown in the first row of Table I. 

When the local strategy (LS) was used to perform 
the classification, the accuracy rate improved by a large 
margin: all three classifiers reached an accuracy rate of 
92%. The largest improvement in performance occurred 
for the KNN classifier; the accuracy rate went from 72% 
to 92%. The features employed in this case were: 


TABLE I. Accuracy rate obtained with the three proposed 
network approaches. Note that, the most accurate results 
occur when the 


Complex network approach 

KNN CART Bayes 

Global strategy without stopwords 

72% 

78% 

75% 

Local strategy without stopwords 

92% 

92% 

92% 

Local strategy with stopwords 

95% 

95% 

95% 
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First principal component 


FIG. 3. Principal component analysis performed using local 
topological properties in networks formed with stopwords. To 
create the visualization, only the features with no correlation 
with the frequency were used. 


Backbone symmetry: Sb^ h \ for h = {2,3,4}; 

Merged symmetry: Sm^ h \ for h = {2,3,4}; and 
Accessibility: A^ h \ for h = {2,3}. 

Note that these local measurements were chosen 
because they do not correlate with the frequency. The 
local strategy with stopwords (LSS) displayed an slight 
better classification performance. In this case, the 
accuracy rate reached 95% with KNN, CART, and Naive 
Bayes classifiers. The principal component analysis 
projection provided in Fig. 3 confirms the suitability of 
this network model for discriminating informative from 
imaginative prose. In this case, the features employed 
were: 

Backbone symmetry: Sb( h \ for h = {2,3,4}; 

Merged symmetry: SmS h \ for h = {2,3,4}; 

Acessibility: A^ h \ for h = 2; and 
Generalized accessibility: ( Ag ). 

Note that, in both local strategies, the accuracy 
rates in the classification are much higher than the 
ones obtained with the global strategy, which suggests 
that a few words account for the informativeness of the 
topological approach. To better understand the factors 











behind the network ability to discriminate informative 
from imaginative prose, we evaluated the relative im¬ 
portance of features employed in the best approach, i.e. 
the local strategy with stopwords. The method employed 
to quantify the relevance of features is described in 
the methodology. According to this method, the most 
relevant features, in decreasing order of relevance were: 

(i) Merged symmetry: Sm^ h=2 \the) 

(ii) Merged symmetry: SrrS h ~ 3 \by) 

(iii) Backbone symmetry: Sb( h=4 \by) 

(iv) Merged symmetry: SrrS h=3 \an) 

(v) Generalized accessibility: Ag(have) 

(vi) Merged symmetry: Sm^ h ~ A \by) 

(vii) Generalized accessibility: Ag(it) 

(viii) Generalized accessibility: Ag(by). 

Note that the most relevant features are those re¬ 
lated to the symmetry of specific words. Interestingly, 
this is consistency with recent results showing that 
symmetry measurements tend to be more discriminative 
than other traditional network measurements [73]. 

It is relevant to highlight that most of the network ap¬ 
proaches for text classification focus on the global prop¬ 
erties of networks. Our results reveal, conversely, that 
the informativeness of the topological strategy concen¬ 
trates in a few nodes. Particularly, the informativeness 
was found to be mostly hidden in the symmetry patterns 
of specific function words. For this reason, we believe 
that the local strategies (LS and LSS) could be useful 
not only for the studied task, but also in several related 
tasks, where the topology of specific words plays a promi¬ 
nent role in characterizing texts. 


Comparison with traditional methods 



First dimension 


FIG. 4. Latent semantic analysis performed to distinguish 
informative from imaginative prose. First ten more frequent 
words were used as features. Note that the style can be identi¬ 
fies by measuring the proximity to specific words. While state 
and system characterize informative documents, say and mr. 
characterize imaginative texts. 



First principal component 


To compare the performance of the proposed technique 
with other traditional techniques, we first analysed if the 
classes can be discriminated via Latent Semantic Anal¬ 
ysis [74], which considers as features the frequency of 
words. The projection obtained with this technique is 
shown in Fig. 4. Note that a good discrimination was ob¬ 
tained in this case, mainly because some words are more 
common in informative documents (e.g. state, system 
and program, while others occur more often in imagina¬ 
tive texts (e.g. say and mr.). A more accurate classifica¬ 
tion system based on stylistic attributes can be created 
if one considers as features the frequency of the most 
informative stopwords. To select the most informative 
stopwords, we used the information gain criterion. Using 
the ATNN classifier (the best classifier), the performance 
reached 97% of accuracy. This high accuracy level can be 
observed in the principal component analysis provided in 
Fig. 5. Another traditional strategy in stylometry con¬ 
sists in counting the frequency of character bigrams [27]. 


FIG. 5. Principal component analysis performed to distin¬ 
guish informative from imaginative prose. The frequency of 
the most informative stopwords were used as features. Note 
that the style developed in informative documents is much 
more regular than the style observed in imaginative texts. 

Considering the most informative bigrams, the accuracy 
rate reached 98%. A visualisation of the data provided 
by this set of features is shown in Fig. 6. 

All in all, the results obtained with traditional classi¬ 
fiers demonstrate that the local topological approach is 
effective as our best results differs only 3% from the most 
efficient traditional system. This result is consistent with 
similar studies showing that the topology plays a relevant 
role in characterising complex systems, especially those 
conveying information [2, 12, 37, 44], Because the pro¬ 
posed representation is complementary to the traditional 
approaches, we advocate that the combination of features 
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First principal component 


FIG. 6. Principal component analysis performed to distin¬ 
guish informative from imaginative prose. The frequency of 
character bigrams were used as features. As it happened in 
Fig. 5, the variability of styles is much higher in imaginative 
texts. 

of distinct nature (traditional and topological) could lead 
to the improvement of similar tasks relying on the accu¬ 
rate characterisation of stylistic marks. 


V. CONCLUSION 

In this paper, we have evaluated the ability of network 
measurements to identify two textual categories, which 
are related to informative and imaginative documents. 
We have extended previous models in a twofold manner. 
First, the local topology of nodes representing specific 
words was studied. We have thus emphasized partic¬ 
ular network regions to characterize the local topology 
of texts. This approach differs from previous networked 
representations because traditional topological analyses 
consider with equal relevance the topological analysis of 
all nodes of the network. Another proposed extension 
is the use of novel network measurements that are able 
to grasp more relevant information than traditional mea¬ 
surements. Particularly, we have used symmetry mea¬ 
surements that are able to quantify the homogeneity of 
access to neighbours. The concept of node degree was 
also extended via introduction of accessibility measure¬ 
ments, which are able to measure the effective number of 
(accessed) neighbours. 

Computational simulations revealed that the proposed 
extensions are able to improve the efficiency of classi¬ 
fication tasks. The best improvement in performance 
when comparing the traditional model and the proposed 
method occurred with the ATNN classifier. An improve¬ 
ment of 23% was observed, thus confirming the efficiency 
of the proposed methodology. A systematic analysis of 
feature relevance revealed that among the most infor¬ 


mative attributes are the symmetry and accessibility in¬ 
dexes applied to particular nodes. These results confirm 
the complementary role played by these measurements in 
characterizing text networks, since they do not correlate 
with traditional natural language processing methods. 
Owing to the generality of the proposed representation 
and characterization, we believe that it could be extend 
to a myriad of related applications where the quantifica¬ 
tion of style is relevant for text categorization. As further 
works, we intend to combine network methods and tra¬ 
ditional statistical methods to improve the performance 
of the classification. We expect, in this case, that the 
interwoven combination of methodologies will be able to 
overcome the limitations of each technique. 
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